Video-based action recognition security system

ABSTRACT

A video monitoring system and method are provided. The video monitoring system includes a camera. The camera is positioned to monitor an area and capture live video to provide a live video stream. The video monitoring system also includes a security processing system. The security processing system includes a processor and memory coupled to the processor. The security processing system is programmed to detect and identify a target action sequence in the live video stream using a multi-layer deep long short-term memory process on are attention factor that is based on an within-frame attention and an between-frame attention. The security processing system is further programmed to trigger an action to alert that a target action sequence has been detected.

RELATED APPLICATION INFORMATION

This application claims priority to U.S. Provisional Patent ApplicationSer. No. 62/318,865 filed on Apr. 6, 2016, incorporated herein byreference in its entirety. Moreover, this application is related tocommonly assigned U.S. patent application Ser. No. TBD (Attorney DocketNumber 15104A), filed concurrently herewith and incorporated herein byreference.

BACKGROUND Technical Field

The present invention generally relates to video-based recognition andmore particularly to video-based action recognition in a monitoringsystem.

Description of the Related Art

Video-based action recognition is the most valuable component ofintelligent monitoring systems for many applications such as publicsafety monitoring, shopping center and factory surveillance, and homesecurity etc. Real-time action recognition based on video sequencesproduced by surveillance cameras not only detects the type of action ofinterest, but also detects the start and end of the searched action,which often contains a sequence of action progression stages orsub-actions, as well as the most relevant time-dependent regions withinvideo frames.

Previous approaches to action recognition mainly fall into the followingtwo categories: A) Feature engineering based on individual video framesby handcrafting features from each video frame and tracking them basedon displacement information from an optical flow field, and B) Machinelearning approaches without considering complex long-range temporaldependencies by extracting features using convolutional neural networks(CNNs) or recurrent neural networks (RNNs), and then using standardclassifiers or RNNs for action prediction without attention or with onlybetween-frame attention.

SUMMARY

According to an aspect of the present principles, a video monitoringsystem is provided. The video monitoring system includes a camera. Thecamera is positioned to monitor an area and capture live video toprovide a live video stream. The video monitoring system furtherincludes a security processing system. The security processing systemincludes a processor and memory coupled to the processor. The securityprocessing system is programmed to detect and identify a target actionsequence in the live video stream using a multi-layer deep longshort-term memory process on an attention factor that is based on awithin-frame attention and a between-frame attention. The securityprocessing system is further programmed to trigger an action to alertthat a target action sequence has been detected.

According to another aspect of the present principles, acomputer-implemented method is provided for home security. The methodincludes monitoring an area with a camera. The method further includescapturing, by the camera, live video to provide a live video stream. Themethod also includes detecting and identifying, by a processor, a targetaction sequence in the live video stream using a multi-layer deep longshort-term memory process on an attention factor that is based on awithin-frame attention and a between-frame attention. The methodadditionally triggering, by the processor, an action to alert that atarget action sequence has been detected.

These and other features and advantages will become apparent from thefollowing detailed description of illustrative embodiments thereof,which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description ofpreferred embodiments with reference to the following figures wherein:

FIG. 1 shows a block diagram of an exemplary processing system to whichthe present invention may be applied, in accordance with an embodimentof the present invention;

FIG. 2 shows a block diagram of an exemplary environment to which thepresent invention can be applied, in accordance with an embodiment ofthe present invention;

FIG. 3 shows a high-level block/flow diagram of an exemplary high-orderconvolutional neural network method, in accordance with an embodiment ofthe present invention;

FIG. 4 is a flow diagram illustrating a method for video based actionrecognition, in accordance with an embodiment of the present invention;

FIG. 5 shows a high-level block/flow diagram of a deep 3D attentionrecurrent neural network method, in accordance with an embodiment of thepresent invention;

FIG. 6 shows a block/flow diagram of a deep 3D attention recurrentneural network method, in accordance with an embodiment of the presentinvention;

FIG. 7 shows a block/flow diagram of a video monitoring system, inaccordance with an embodiment of the present invention; and

FIG. 8 is a flow diagram illustrating a method for video monitoring, inaccordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

A system using Deep 3D attention Long Short-Term Memory for video basedaction recognition is presented. Unlike previous approaches, this systemis capable of capturing long-range complex temporal dependencies in longvideo sequences with both between-frame and within-frame attention. Thissystem uses a novel objective function enabling users to easily identifykey video segments for target actions. Target actions may include anintruder entering a restricted area, a confined animal escaping anenclosure, or a piece of machinery malfunctioning and endangering peopleor property in the machineries vicinity, etc. It is to be understoodthat the target actions listed and described herein are onlyillustrative of the principles of the present invention and that thoseskilled in the art may implement various modifications without departingfrom the scope and spirit of the invention.

FIG. 1 shows a block diagram of an exemplary processing system 100 towhich the invention principles may be applied, in accordance with anembodiment of the present invention. The processing system 100 includesat least one processor (CPU) 104 operatively coupled to other componentsvia a system bus 102. A cache 106, a Read Only Memory (ROM) 108, aRandom Access Memory (RAM) 110, an input/output (I/O) adapter 120, asound adapter 130, a network adapter 140, a user interface adapter 150,and a display adapter 160, are operatively coupled to the system bus102.

A first storage device 122 and a second storage device 124 areoperatively coupled to system bus 102 by the I/O adapter 120. Thestorage devices 122 and 124 can be any of a disk storage device (e.g.,,a magnetic or optical disk storage device), a solid state magneticdevice, and so forth. The storage devices 122 and 124 can be the sametype of storage device or different types of storage devices.

A speaker 132 is operatively coupled to system bus 102 by the soundadapter 130. The speaker 132 can be used to provide an audible alarm orsome other indication relating to the present invention. A transceiver142 is operatively coupled to system bus 102 by network adapter 140. Adisplay device 162 is operatively coupled to system bus 102 by displayadapter 160.

A first user input device 152, a second user input device 154, and athird user input device 156 are operatively coupled to system bus 102 byuser interface adapter 150. The user input devices 152, 154, and 156 canbe any of a keyboard, a mouse, a keypad, an image capture device, amotion sensing device, a microphone, a device incorporating thefunctionality of at least two of the preceding devices, and so forth. Ofcourse, other types of input devices can also be used, while maintainingthe spirit of the present invention. The user input devices 152, 154,and 156 can be the same type of user input device or different types ofuser input devices. The user input devices 152, 154, and 156 are used toinput and output information to and from system 100.

Of course, the processing system 100 may also include other elements(not shown), as readily contemplated by one of skill in the art, as wellas omit certain elements. For example, various other input devices andor output devices can be included in processing system 100, dependingupon the panicalar implementation of the same, as readily understood byone of ordinary skill in the art. For example, various types of wirelessand/or wired input and/or output devices can be used, Moreover,additional processors, controllers, memories, and so forth, in variousconfigurations can also be utilized as readily appreciated by one ofordinary skill in the art. These and other variations of the processingsystem 100 are readily contemplated by one of ordinary skill in the artgiven the teachings of the present invention provided herein.

Moreover, it is to be appreciated that environment 200 described belowwith respect to FIG. 2 is an environment for implementing respectiveembodiments of the present invention. Part or all of processing system100 may be implemented in one or more of the elements of environment200.

Further, it is to be appreciated that processing system 100 may performat least part of the method described herein including, for example, atleast part of method 300 of FIG. 3 and/or at least part of method 400 ofFIG. 4 and/or at least part of method 500 of FIG. 5 and/or at least partof method 600 of FIG. 6 and/or at least part of method 800 of FIG. 8.Similarly, part or all of system 200 may be used to perform at leastpart of method 300 of FIG. 3 and/or at least pan of method 400 of FIG. 4and/or at least part of method 500 of FIG. 5 and/or at least part ofmethod 600 of FIG. 6 and/or at least part of method 800 of FIG. 8.

FIG. 2 shows an exemplary environment 200 to which the present inventioncan be applied, in accordance with an embodiment of the presentinvention. The environment 200 is representative of a computer networkto which the present invention can be applied. The elements shownrelative to FIG. 2 are set forth for the sake of illustration. However,it is to be appreciated that the present invention can be applied toother network configurations as readily contemplated by one of ordinaryskill in the art given the teachings of the present invention providedherein, while maintaining the spirit of the present invention.

The environment 200 at least includes a set of computer processingsystems 210. The computer processing systems 210 can be any type ofcomputer processing system including, but not limited to, servers,desktops, laptops, tablets, smart phones, media playback devices, and soforth. For the sake of illustration, the computer processing systems 210include server 210A, server 210B, and server 210C.

In an embodiment, the present invention performs a deep 3D attentionrecurrent neural network method for any of the computer processingsystems 210. Thus, any of the computer processing systems 210 canperform video analysis that can be stored in, or accessed by, any of thecomputer processing systems 210. Moreover, the output (including activevideo segments) of the present invention can be used to control othersystems and/or devices and/or operations and/or so forth, as readilyappreciated by one of ordinary skill in the art given the teachings ofthe present invention provided herein, while maintaining the spirit ofthe present invention.

In the embodiment shown in FIG. 2, the elements thereof areinterconnected by a network(s) 201. However, in other embodiments, othertypes of connections can also be used. Additionally, one or moreelements in FIG. 2 may be implemented by a variety of devices, whichinclude but are not limited to, Digital Signal Processing (DSP)circuits, programmable processors, Application Specific IntegratedCircuits (ASICs), Field Programmable Gate Arrays (FPGAs) ComplexProgrammable Logic Devices (CPLDs), and so forth. These and othervariations of the elements of environment 200 are readily determined byone of ordinary skill in the art, given the teachings of the presentinvention provided herein, while maintaining the spirit of the presentinvention.

FIG. 3 shows a high-level block/flow diagram of an exemplary high-orderconvolutional neural network method 300, in accordance with anembodiment of the present invention.

At step 310, receive an input image 311.

At step 320, perform convolutions on the input image 311 to obtain maps321.

At step 330, perform sub-sampling on the high-order feature maps 321 toobtain a set of maps 331.

At step 340, perform convolutions on the set of maps 331 to obtainanother set of maps 341.

At step 350, perform sub-sampling on the other set of maps 341 to obtainyet another set of maps 351 that form a fully connected layer 352. Thefully connected layer 352 provides a feature vector 352A.

It is to be appreciated that the neurons in the fully connected layer352 have full connections to all activations in the previous layer.Their activations can hence be computed with a matrix multiplicationfollowed by a bias offset.

We can optionally have more fully connected layers rather than just 352and more repeated steps of 320 and 330 rather than just 340 and 350depending on different tasks.

It is to be further appreciated that while a single image is mentionedwith respect to step 310, multiple images such as in the case of one ormore video sequences can be input and processed in accordance with themethod 300 of FIG. 3, while maintaining the spirit of the presentinvention.

Referring to FIG. 4, a flow chart for a video based action recognitionmethod 400 is illustratively shown, in accordance with an embodiment ofthe present invention. In block 410, receive one or more frames from oneor more video sequences. In block 420, generate, using a deepconvolutional neural network, a feature vector for each patch of the oneor more frames. In block 430, generate an attention factor for thefeature vectors based on a within-frame attention and a between-frameattention. In block 440, identify a target action using a multi-layerdeep long short-term memory process applied to the attention factor. Thetarget action represents at least one of the one or more videosequences. In block 450, control an operation of a processor-basedmachine to change a state of the processor-based machine, responsive tothe at least one of the one or more video sequences including theidentified target action.

Deep 3D attention Long Short-Term Memory (LSTM) may contain multiplemodules. In one embodiment, the Deep 3D attention LTSM may include aninput module. The input module may be a deep convolutional neuralnetwork (CNN). For each time frame at time point t, the output of thelast convolutional layer is utilized, which contains K patches and eachpatch is a D dimensional feature vector. The output of this module is aset of features x_(i) ^(t) ∈

, where t ∈ {1, . . . , T} is the time point index of the frame and i ∈{1, . . . , K} is the index of the patch. The convolution patch size isa learnable non-fixed parameter.

In another embodiment, the Deep 3D attention LTSM may include anattention module. The attention module may contain within-frameattention and between-frame attention, and each could either be a hardor a soft attention. Hard attention assesses certain aspects of theframe one feature at a time and aggregates the information. Softattention assesses the frame by concentrating on certain key featuresbased on all the features. The within-frame soft attention weight α_(i)^(t) for patch i of frame t is achieved by:

α_(i) ^(t)=softmax(w _(i) ^(T) x _(i) ^(t)),

where w_(i) ∈

, x_(i) ^(t) is the feature representation of patch i of frame tgenerated by the deep CNN, i ∈ {1, . . . , K} are learnable parametersand softmax(z_(i))=e^(z) ^(i) /Σ_(j)e^(z) ^(j) . Applying thewithin-frame level attention gives the between-frame level attention'sinput:

x^(t)=Σ_(i=1) ^(K)α_(i) ^(t)x_(i) ^(t).

Other options for within-frame level attention could be multilayerperceptron (MLP) followed by a softmax layer. For between-frame softattention, we use bidirectional LSTMs with:

{right arrow over (h ^(t))},{right arrow over (c ^(t))}=LSTM_(fwd) (x^(t),{right arrow over (h ^(t−1))}, {right arrow over (c ^(t−1))}),

=LSTM_(bwd) (x ^(t),

),

h ^(t)={right arrow over (h ^(t))}+

,

where x^(t) is the output of the within-frame attention at time point t,{right arrow over (h^(t))},{right arrow over (c^(t))}are the hiddenstate and the cell state of the forward LSTM at time point t,

are the hidden state and cell state of the backward LSTM at time pointt, h^(t) is the final hidden state which contains information from boththe future and the past. Given the bandwidth L (i.e. a free parameter)of between-frame attention, the between-frame attention could becalculated with:

${\beta^{t} = \frac{a^{t^{T}}h^{t}}{\Sigma_{j = {t - L}}^{t + L}a^{j^{T}}h^{j}}},$

where α^(t) ∈

are learnable parameters and M is the hidden state dimension in LSTM.The final 3D attention module output, or attention factor, at time pointt is;

s ^(t)=Σ_(j=t−L) ^(t+L)β^(j)Σ_(i=1) ^(K)α_(i) ^(j) x _(i) ^(j).

In yet another embodiment, the Deep 3D attention LTSM may include anoutput module. The output module may apply a multi-layers deep LSTM toproduce q^(t) ∈

, where

is the number of action classes. The final output being:

${\hat{y}}_{c}^{t} = {{{softmax}\left( {\frac{1}{{2L} + 1}{\sum_{j = {t - L}}^{t + L}q_{c}^{j}}} \right)}.}$

In still another embodiment, the Deep 3D attention LTSM may include adomain knowledge module. The domain knowledge module may be achieved byembedding a target or additional knowledge followed by a dot productwith the output of the input module.

The cross-entropy loss function has three choices (N is the number ofsamples), and the training is performed by back-prorogation:

To use the last time point:

=−Σ_(i=1) ^(N)Σ_(c=1) ^(C) y _(c) log ŷ _(c) ^(T);

To use all time points:

=−Σ_(i=1) ^(N)Σ_(t=1) ^(T)Σ_(c=1) ^(C) y _(c) log ŷ _(c) ^(t);

To use the maximum probability's time point (max-neighbor):

=−Σ_(i=1) ^(N)Σ_(c=1) ^(C) y _(c) log(max_(t=1) ^(T)(ŷ _(c) ^(T))).

FIG. 5 shows a high-level block/flow diagram of a deep 3D attentionrecurrent neural network method 500, in accordance with an embodiment ofthe present invention. The deep 3D attention recurrent neural networkmethod 500 may include a video 510 (with one embodiment used in step 610in FIG. 6) to supply the video frames analyzed in the deep 3D attentionrecurrent neural newtork method 500. The video 510 may be fed into anadaptive patch size convoluton network 520 (with one embodiment used instep 620 in FIG. 6) to be produce vectors representing the frames of thevideo. In one embodiment, the adaptive patch size convoluton network 520may function as the input module as described above in the Deep 3Dattention LTSM.

The deep 3D attention recurrent neural network method 500 may include adomain knowledge process 540. The domain know ledge process 540 mayembed additional knowledge with a dot product of the vectors produced bythe adaptive patch size convolution network 520. In one embodiment, thedomain knowledge process 540 may function as the domain knowledge moduleas described above in the Deep 3D attention LTSM.

The deep 3D attention recurrent neural network method 500 may include a3D attention process 530 (with one embodiment used in steps 630 and 640in FIG. 6). In one embodiment, the 3D attention process may take thevectors from the adaptive patch size convolution network 520 to producefinal 3D attention values. In another embodiment, the 3D attentionprocess may take the vectors from the adaptive patch size convolutionnetwork 520 and the additional knowledge embedded by the knowledgedomain process 540 to produce final 3D attention values. In yet anotherembodiment, the 3D attention process 530 may function as the attentionmodule as described above in the Deep 3D attention LTSM.

The deep 3D attention recurrent neural network method 500 may include across entropy with max-neighbor process 550 (with one embodiment used instep 650 in FIG. 6). In one embodiment, the cross entropy withmax-neighbor process 550 may apply a deep LSTM to the final 3D attentionvalues from the 3D attention process 530 to produce the final output. Inanother embodiment, the cross entropy with max-neighbor process 550 mayutilize a cross-entropy loss function as described above. In yet anotherembodiment, the cross entropy with max-neighbor process 550 may functionas the output module as described above in the Deep 3D attention LTSM.

The deep 3D attention recurrent neural network method 500 may include anaction category 560 (with one embodiment used in step 660 in FIG. 6).The action category 560 represents the action the deep 3D attentionrecurrent neural network method 500 detected from the video 510.

FIG. 6 shows a block/ low diagram of a deep 3D attention recurrentneural network method 600, in accordance with an embodiment of thepresent invention.

At step 610, receive video frames 612 over time 611.

At step 620, perform convolutions 621 on the video frames 612 to obtaina set of features 622 and a set of learnable parameters 623.

At step 630, perform softmax 631 on the set of features 622 and the setof learnable paraeters 623 to obtain the within-frame level attentioninput 632.

At step 640, perform bidirectional LSTM 641 and softmax 642 on thewithin-frame level attention input 632 to obtain the 3D attention output643.

At step 650, perform a deep LSTM 651 on the 3D attention output 643 toobtain the RNN output 652.

At step 660, passing the RNN output 652 into the action category 661.

The invention as described may be used in many different embodiments.One useful embodiment may have the invention in a video monitoringsystem. FIG. 7 shows a block/flow diagram of a video monitoring system700, in accordance with an embodiment of the present invention. Thevideo monitoring system 700 may include a security processing system710. The security processing system 710 may include a processing system100 from in FIG. 1. The security processing system 710 may be equippedwith computing functions and control. The security processing system 710may include one or more processors 711 (hereafter “processor”). Thesecurity processing system 710 may include a memory storage 712. Thememory storage 712 may include solid state or soft storage and work inconjunction with other devices of the video monitoring system 700 torecord data, run algorithms or programs, store safety procedures, a deep3D attention recurrent neural network, etc. The memory storage 712 mayinclude a Read Only Memory (ROM), random access memory (RAM), or anyother type of memory useful for the present applications.

The security processing system 710 may include a communication array 716to handle communications between the different devices in the videomonitoring system 700. In one embodiment, the communication array 716may be equipped to communicate with a cellular network system. In thisway, the security processing system 710 may contact a control centerwith information related to the status of the video monitoring system700 and the property the system is securing. The communication array 716may include a WIFI or equivalent radio system, a local area network(LAN), hardwired system, etc. The communication array 716 may providethe security processing system 710 a communication channel 760 withother devices in the video monitoring system 700.

The security processing system 710 may include a power source 715. Thepower source 715 may include or employ one or more batteries, agenerator with liquid fuel (e.g., gasoline, alcohol, diesel, etc.) orother energy source. In another embodiment, the power source 715 mayinclude one or more solar cells or one or more fuel cells. In anotherembodiment, the power source 715 may include power from the buildingwith the video monitoring system 700. The security processing system 710may have multiple sources in the power source 715. In one embodiment,the security processing system 710 may include power directly from thebuilding and a battery system as a back-up to ensure the videomonitoring system 700 stays active if a power interruption occurs.

The security processing system 710 may include a security light 713. Thesecurity light 713 may be illuminated when the security processingsystem 710 detects an intruder in the area of the security light 713 todeter the intruder or give investigators improved visibility in the areaof the security light 713. The security processing system 710 mayinclude a speaker 714. The speaker 714 may act as an alarm when thesecurity processing system 710 detects an intruder in a secure area todeter the intruder or notify investigators of an intruder.

Of course, the security processing system 710 may also include otherelements (not shown), as readily contemplated by one of skill in theart, as well as omit certain elements. For example, various other inputdevices and/or output devices can be included in the security processingsystem 710, depending upon the particular implementation of the same, asreadily understood by one of ordinary skill in the art. For example,various types of wireless and/or wired input and/or output devices canbe used. Moreover, additional processors, displays, controllers,memories, and so forth, in various configurations can also be utilizedas readily appreciated by one of ordinary skill in the art. These andother variations of the security processing system 710 are readilycontemplated by one of ordinary skill in the art given the teachings ofthe present invention provided herein.

The video monitoring system 700 may include a camera 720. The camera 720may communicate through the communication channel 760 to the securityprocessing system 710. The camera 720 may include a power source 722.The power source 722 may include or employ one or more batteries orother energy source. In another embodiment, the power source 722 mayinclude one or more solar cells or one or more fuel cells. In anotherembodiment, the power source 722 may include power from the buildingwith the video monitoring system 700. In yet another embodiment, thepower source 722 may include power through the communication channel 760linking the camera 720 to the security processing system 710. The camera720 may have multiple sources in the power source 722. In oneembodiment, the camera 720 may include power through the communicationchannel 760 and a battery system as a back-up to ensure the camera 720stays active if a power interruption occurs.

The camera 720 may include a communication array 724 to handlecommunications between the camera 720 and the security processing system710. In one embodiment, the communication array 724 may be equipped tocommunicate with a cellular network system. The communication array 724may include a WIFI or equivalent radio system, a local area network(LAN), hardwired system, etc. The communication array 724 may connectthe camera 720 to the security processing system 710 through thecommunication channel 760.

The camera 720 may include one or more motor 726. The motor 726 mayphysically move the camera 720, so the field of view covered by thecamera 720 is greater than the field of view of the camera 720. Themotor 726 may be used to zoom a lens in the camera 720 to get a zoomedin image of the area being covered by the camera 720. The motor 720 maybe controlled by commands originating in the camera 720 or from commandsoriginating in the security processing system 710.

Of course, the camera 720 may also include other elements (not shown),as readily contemplated by one of skill in the art, as well as omitcertain elements. For example, various other lens or lights for nightvision or infrared detection may be included in the camera 720,depending upon the particular implementation of the same, as readilyunderstood by one of ordinary skill in the art.

The video monitoring system 700 may include an electronic lock 730. Theelectronic lock 730 may communicate through the communication channel760 to the security processing system 710. The electronic lock 730 mayinclude a power source 736. The power source 736 may include or employone or more batteries or other energy source. In another embodiment, thepower source 736 may include one or more solar cells or one or more fuelcells. In another embodiment, the power source 736 may include powerfrom the building with the video monitoring system 700. In yet anotherembodiment, the power source 736 may include power through thecommunication channel 760 linking the electronic lock 730 to thesecurity processing system 710. The electronic lock 730 may havemultiple sources in the power source 736. In one embodiment, theelectronic lock 730 may include power through the communication channel760 and a battery system as a back-up to ensure the electronic lock 730stays active if a power interruption occurs.

The electronic lock 730 may include a communication array 738 to handlecommunications between the electronic lock 730 and the securityprocessing system 710. In one embodiment, the communication array 738may be equipped to communicate with a cellular network system. Thecommunication array 738 may include a WIFI or equivalent radio system, alocal area network (LAN), hardwired system, etc. The communication array738 may connect the electronic lock 730 to the security processingsystem 710 through the communication channel 760.

The electronic lock 730 may include a motor 734. The motor 734 mayphysically actuate a bolt in the electronic lock 730. In one embodiment,the motor 734 actuates one or more bolts along a door to lock the door.In another embodiment, the motor 734 may actuate a hook in a window tolock the window. The motor 734 may be controlled by commands originatingin the electronic lock 730 or from commands originating in the securityprocessing system 710.

The electronic lock 730 may include a solenoid 732. The solenoid 732 mayphysically actuate a bolt in the electronic lock 730. In one embodiment,the solenoid 732 actuates one or more bolts along a door to lock thedoor. In another embodiment, the solenoid 732 may actuate a hook in awindow to lock the window. The solenoid 732 may be controlled bycommands originating in the electronic lock 730 or from commandsoriginating in the security processing system 710.

Of course, the electronic lock 730 may also include other elements (notshown), as readily contemplated by one of skill in the art, as well asomit certain elements. For example, various other engaging mechanismsmay be included in the electronic lock 730, depending upon theparticular implementation of the same, gas readily understood by one ofordinary skill in the art.

The video monitoring system 700 may include an input console 740. Theinput console 740 may communicate through the communication channel 760to the security processing system 710. The input console 740 may includea power source 748. The power source 748 may include or employ one ormore batteries or other energy source. In another embodiment, the powersource 748 may include one or more solar cells or one or more fuelcells. In another embodiment, the power source 748 may include powerfrom the building with the video monitoring system 700. In yet anotherembodiment, the power source 748 may include power through thecommunication channel 760 linking the input console 740 to the securityprocessing system 710. The input console 740 may have multiple sourcesin the power source 748. In one embodiment, the input console 740 mayinclude power through the communication channel 760 and a battery systemas a back-up to ensure the input console 740 stays active if a powerinterruption occurs.

The input console 740 may have one or more input devices 741. The inputdevices 741 may include a keypad 742, a retinal scanner 744, or afingerprint reader 746. The input console 740 may include more than oneof the input devices 741. In one embodiment, the input console 740 mayinclude a keypad 712 and a fingerprint reader 746 to support two-factorauthentication. In one embodiment, the input console 740 may include akeypad 742, a retinal scanner 744. and a fingerprint reader 744 tosupport three-factor authentication.

The input console 740 may include a communication array 749 to handlecommunications between the input console 740 and the security processingsystem 710. In one embodiment, the communication array 749 may beequipped to communicate with a cellular network system. Thecommunication array 749 may include a WIFI or equivalent radio system, alocal area network (LAN), hardwired system, etc. The communication array749 may connect the input console 740 to the security processing system710 through the communication channel 760.

Of course, the, input console 740 may also include other elements (notshown), as readily contemplated by one of skill in the art, as well asomit certain elements. For example, various other input devices may beincluded in the input console 740, such as a camera for facialrecognition, depending upon the particular implementation of the same,as readily understood by one of ordinary skill in the art.

The video monitoring system 700 may include one or more sensors 750(hereafter “sensor”). The sensor 750 may communicate through thecommunication channel 760 to the security processing system 710. Thesensor 750 may include a power source 756. The power source 756 mayinclude or employ one or more batteries or other energy source. Inanother embodiment, the power source 756 may include one or more solarcells or one or more fuel cells. In another embodiment, the power source756 may include power from the building with the video monitoring system700. In yet another embodiment, the power source 756 may include powerthrough the communication channel 760 linking the sensor 750 to thesecurity processing system 710. The sensor 750 may have multiple sourcesin the power source 756. In one embodiment, the sensor 750 may includepower through the communication channel 760 and a batter system as aback-up to ensure the input console 740 stays active if a powerinterruption occurs.

The sensor 750 may have one or more sensor types 751. The sensor types751 may include audio 752 or contact 754. The sensor 750 may includemore than one of the sensor types 751. In one embodiment, the sensor 750may include an audio 752 and a contact 754. This embodiment may secure awindow being able to detect when the window is closed with the contact754 and being able to detect if broken with the audio 752.

The sensor 750 may include a communtication array 758 to handlecommunications between the sensor 750 and the security processing system710. In one embodiment, the communication array 758 may be equipped tocommunicate with a cellular network system. The communication array 758may include a WIFI or equivalent radio system, a local area network(LAN), hardwired system, etc. The communication array 758 may connectthe sensor 750 to the security processing system 710 through thecommunication channel 760.

Of course, the sensor 750 may also include other elements (not shown),as readily contemplated by one of skill in the art, as well as omitcertain elements. For example, various other types of sensors may beincluded in the sensor 750, such as a temperature sensor for detectingbody heat, depending upon the particular implementation of the same, asreadily understood by one of ordinary skill in the art.

The security processing system 710 may take video from the camera 720 tomonitor the area being secured by the video monitoring system 700. Thesecurity processing system 710 may recognize action in the video that isoutside a normal criteria. This action may include an intruder runningup to the premises or a projectile approaching the premises. In oneembodiment, the security processing system 710 may actuate theelectronic locks 730 on the premises to secure the premises whilesounding an alarm over the speaker 714 and turning on the security light713. The security processing system 710 may also clip the video of theaction sequence and send it to a security monitoring station or the homeowner to have evidence of the intrusion or both. In another embodiment,the security processing system 710 may actuate the motor 734 in theelectric lock 730 to close and lock windows when the action recognizedis rain. Many other actions can be recognized with the present system,with different actions having different responses. In one embodiment,the security processing system 710 may use the electronic lock 730 tosecure a pet door when the video shows a raccoon approaching the petdoor.

Moreover, it is to be appreciated that video monitoring system 700 mayperform at least part of the method described herein including, forexample, at least part of method 300 of FIG. 3 and/or at least part ofmethod 400 of FIG. 4 and/or at least part of method 500 of FIG. 5 and/orat least part of method 600 of FIG. 6 and/or at least part of method 800of FIG. 8.

Referring to FIG. 8, a flow chart for a video monitoring method 800 isillustratively shown, in accordance with an embodiment of the presentinvention. In block 810, monitor an area with a camera. In block 820,capture, by the camera, live video as to provide a live video stream. Inblock 830, detect and identify a target action sequence in the livevideo stream using a multi-layer deep long short-term memory process onan attention factor that is based on a within-frame attention and abetween-frame attention. In block 840, trigger an action to alert that atarget action sequence has been detected.

Embodiments described herein may be entirely hardware, entirely softwareor including both hardware and software elements. In a preferredembodiment, the present invention is implemented in software, whichincludes but is not limited to firmware, resident software, microcode,etc.

Embodiments may include a computer program product accessible from acomputer-usable or computer-readable medium providing program code foruse by or in connection with a computer or any instruction executionsystem. A computer-usable or computer readable medium may include anyapparatus that stores, communicates, propagates, or transports theprogram for use by or in connection with the instruction executionsystem, apparatus, or device. The medium can be magnetic, optical,electronic, electromagnetic, infrared, or semiconductor system (orapparatus or device) or a propagation medium. The medium may include acomputer-readable storage medium such as a semiconductor or solid statememory, magnetic tape, a removable computer diskette, a random accessmemory (RAM), a read-only memory (ROM), a rigid magnetic disk and anoptical disk, etc.

Each computer program may be tangibly stored in a machine-readablestorage media or device (e.g., program memory or magnetic disk) readableby a general or special purpose programmable computer, for configuringand controlling operation of a computer when the storage media or deviceis read by the computer to perform the procedures described herein. Theinventive system may also be considered to be embodied in acomputer-readable storage medium, configured with a computer program,where the storage medium so configured causes a computer to operate in aspecific and predefined manner to perform the functions describedherein.

A data processing system suitable for storing and/or executing programcode may include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code to reduce the number of times code is retrieved frombulk storage during execution. Input/output or I/O devices (includingbut not limited to keyboards, displays, pointing devices, etc.) may becoupled to the system either directly or through intervening I/Ocontrollers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

Reference in the specification to “one embodiment” or “an embodiment” ofthe present invention, as well as other variations thereof, means that aparticular feature, structure, characteristic, and so forth described inconnection with the embodiment is included in at least one embodiment ofthe present invention. Thus, the appearances of the phrase “in oneembodiment” or “in an embodiment”, as well any other variations,appearing in various places throughout the specification are notnecessarily all referring to the same embodiment.

It is to be appreciated that the use of any of the following “/”,“and/or”, and “at least one of”, for example, in the cases of “A/B”, “Aand/or B” and “at least one of A and B”, is intended to encompass theselection of the first listed option (A) only, or the selection of thesecond listed option (B) only, or the selection of both options (A andB). As a further example, in the cases of “A, B, and/or C” and “at leastone of A, B, and C”, such phrasing is intended to encompass theselection of the first listed option (A) only, or the selection of thesecond listed option (B) only, or the selection of the third listedoption (C) only, or the selection of the first and the second listedoptions (A and B) only, or the selection of the first and third listedoptions (A and C) only, or the selection of the second and third listedoptions (B and C) only, or the selection of all three options (A and Band C). This may be extended, as readily apparent by one of ordinaryskill in this and related arts, for as many items listed.

The foregoing is to be understood as being in ever respect illustrativeand exemplary, but not restrictive, and the scope of the inventiondisclosed herein is not to be determined from the Detailed Description,but rather from the claims as interpreted according to the full breadthpermitted by the patent laws. It is to be understood that theembodiments shown and described herein are only illustrative of theprinciples of the present invention and that those skilled in the artmay implement various modifications without departing from the scope andspirit of the invention. Those skilled in the art could implementvarious other feature combinations without departing from the scope andspirit of the invention. Having thus described aspects of the invention,with the details and particularity required by the patent laws, what isclaimed and desired protected by Letters Patent is set forth in theappended claims.

What is claimed is:
 1. A video monitoring system comprising: a camerapositioned to monitor an area and capture live video to provide a livevideo stream; a security processing system is a processor and memorycoupled to the processor, the processing system programmed to: detectand identify a target action sequence in the live video stream using amulti-layer deep long short-term memory process on an attention factorthat is based on a within-frame attention and a between-frame attention;and trigger an action to alert that a target action sequence has beendetected.
 2. The system of claim 1, further comprising one or moresensors capable of detecting a change of state.
 3. The system of claim2, wherein the one or more sensors include a sensor selected from thegroup consisting of a temperature sensor, a contact sensor, and an audiosensor.
 4. The system of claim 1, further comprising a speaker thatsounds an alarm when receiving the action from the security controller.5. The system of claim 1, wherein the processing system is furtherprogrammed to recognize targeted action sequences when the videomonitoring system is in both an activated state and a deactivated state.6. The system of claim 1, wherein the processing system is furtherprogrammed to record a video clip of the live video stream when thetargeted action sequence is identified.
 7. The system of claim 6,wherein the processing system is further programmed to send the videoclip offsite to a user or a security monitoring station.
 8. The systemof claim 6, wherein a user selects the targeted action sequence from oneor more targeted action sequences, wherein the one or more targetedaction sequences include an action sequence selected from the groupconsisting of a human intrusion, an animal intrusion, or a rainintrusion.
 9. The system of claim 1, further comprising an electroniclock capable of changing a lock state responsive to receiving the actionfrom the processing system.
 10. The system of claim 9, wherein theelectronic lock can both close and secure a door connected to theelectronic lock.
 11. The system of claim 1, further comprising an inputconsole to transmit an activation command to the processing system whenthe activation command is entered by a user or a deactivation command tothe processing system when the deactivation command is entered by auser.
 12. The system of claim 11, wherein the input console include aninput device selected from the group consisting of a keypad, a retinalscanner, and a fingerprint reader.
 13. The system of claim 11, whereinthe deactivation command requires two-factor authentication of the user.14. The system of claim 1, wherein the within-frame attention and, thebetween-frame attention use at least one of a softmax layer and abidirectional long short-term memory process.
 15. The system of claim 1,wherein the within-frame attention and the between-frame attentioninclude an attention selected from the group consisting of a hardattention and a soft attention.
 16. The system of claim 1, wherein themulti-layer deep long short-term memory process utilizes a cross-entropyloss function.
 17. The system of claim 15, wherein the cross-entropyloss function includes a function selected from the group consisting ofa last time point cross-entropy loss function, an all-time pointcross-entropy loss function, and a max-neighbor cross-entropy lossfunction.
 18. The system of claim 1, wherein the within-frame attentionincludes a multilayer perceptron feeding into a softmax layer.
 19. Acomputer-implemented method for home security, the method comprising:monitoring an area with a camera; capturing, by the camera, live videoto provide a live video stream; detecting and identifying, by aprocessor, a target action sequence in the live video stream using amulti-layer deep long short-term memory process on an attention factorthat is based on a within-frame attention and a between-frame attention;and triggering, by the processor, an action to alert that a targetaction sequence has been detected.
 20. The method of claim 19, whereinthe within-frame attention and the between-frame attention include anattention selected from the group consisting of a hard attention and asoft attention.